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Abstract: Penalized regression methods are popular when the number p 
of covariates exceeds the number n of samples, and are well studied. In 
many applications where p >> n problems often occur, like in genomics, 
covariates are subject to measurement error. We study the impact of mea- 
surement error on linear regression with the Lasso penalty. Bounds for 
estimation and prediction error are derived, as well as a beta-min condition 
for covariate screening, showing explicitly how measurement error affects 
the statistical performance of the Lasso. Finite sample conditions for sign 
consistent covariate selection by the Lasso with measurement error are also 
derived. A simple method of correction for measurement error in the Lasso 
is then considered. Finite sample conditions for sign consistent covariate se- 
lection with this corrected Lasso are compared to the uncorrected case. In 
the large sample limit, the corrected Lasso yields sign consistent covariate 
selection under conditions very similar to the Lasso with perfect measure- 
ments, whereas the uncorrected Lasso requires much more stringent condi- 
tions on the covariance structure of the data. Finally, we suggest methods 
to correct for measurement error in generalized linear models with Lasso 
penalty, illustrated for the case of logistic and Poisson regression. The cor- 
rection methods are computationally efficient in the high-dimensional case, 
and are studied further in simulation experiments. 

Keywords and phrases: generalized linear model, high-dimensional data. 
Lasso, measurement error, sparse regression. 



1. Introduction 

Due to rapid technological progress, complex, high-dimensional data sets are 
now commonplace in a range of fields, e.g., genomics (Boulesteix et al., 2008; 
Clarke et al., 2008; Li and Xu, 2009) and finance (Fan et al., 2011). In regres- 
sion problems, the traditional least squares approach breaks down when the 
parameters outnumber the sample size (p > n). Various penalization schemes 
have been proposed, which overcome this by shrinking the parameter space, 
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including the Dantzig Selector (Candes and Tao, 2007), the Lasso (Tibshirani, 
1996), Ridge regression (Hoerl and Kennard, 1970) and the SCAD (Fan and Li, 
2001). The Lasso has been extensively used in applied problems (e.g., Chatter- 
jee et al. (2011); Zou et al. (2011)), and its statistical scope and limitations are 
well understood (e.g., Bickel et al. (2009); Bunea (2008); Knight and Fu (2000); 
Meinshausen and Biihlmann (2006); Zhao and Yu (2006) and the recent book 
Biihlmann and van de Geer (2011)). A common assumption is sparsity, i.e., 
only a small number of covariates influence the outcome. Several refinements 
have been proposed, in particular the adaptive Lasso (Zou, 2006), which relaxes 
the rather strict conditions required for consistent covariate selection (sparsity 
recovery) of the standard Lasso. 

Measurement error in the covariates is a problem in various high-dimensional 
data sets. In genomics, examples include gene expression microarray data (Hein 
et al., 2005; Purdom and Holmes, 2005; Rocke and Durbin, 2001) and high- 
throughput sequencing (Benjamini and Speed, 2012). In classical regression 
models, measurement error is known to cause biased parameter estimates and 
lack of power (Buonaccorsi, 2010; Carroll et al., 2006). Measurement error in 
SCAD regression has been studied, though with focus on the low-dimensional 
regime (Liang and Li, 2009; Ma and Li, 2010; Xu and Yu, 2007; You et al., 
2008; Zhou et al., 2011). For high-dimensional data, Rosenbaum and Tsybakov 
(2010) introduced the Matrix Uncertainty (MU) Selector, a modification of the 
Dantzig Selector, which handles measurement error and missing data. Through 
analytical results for the finite sample case, the MU Selector is shown to give 
good parameter estimation and covariate selection. An improved MU Selector 
is presented in Rosenbaum and Tsybakov (2011). Loh and Wainwright (2012) 
consider generalized M-estimators with Lasso regularization, of which special 
cases include correction for additive measurement error or missing data. The 
method is shown to yield estimates close to the true parameters, as measured 
in the ii- or ^2-norm, and is computationally feasible in the high-dimensional 
case, despite its non-convexity. We also mention Chen and Caramanis (2012), 
who consider high-dimensional measurement error problems with independent 
columns, and develop a modified Orthogonal Matching Pursuit algorithm yield- 
ing correct sparsity recovery with high probability, also when estimates of the 
measurement error do not exist. 

In the first half of this paper, we study the effect of measurement error in 
the Lasso. In particular, we ask the question: Under which conditions can the 
standard Lasso (naive approach) be safely used, and when are correction meth- 
ods required? For a linear model with additive measurement error, we derive 
finite sample guarantees for the prediction and estimation error of the naive 
Lasso. The results show explicitly how measurement error increases the upper 
bounds on the estimation and prediction error. A beta-min condition for covari- 
ate screening is also derived, showing explicitly how the minimum coefficient 
magnitude warranting a non-zero estimate, increases in the measurement error. 
We also show that the naive Lasso yields asymptotically sign consistent covari- 
ate selection under very stringent conditions on the noise. Next, a correction of 
the Lasso loss function for linear models is considered, which compensates for 
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measurement error in the covariates. The correction has been studied earher by 
Loh and Wainwright (2012) for the Lasso, and Liang and Li (2009) for SCAD 
penalization. In the finite sample case, we derive conditions under which this cor- 
rected Lasso yields sign consistent covariate selection, and show that it performs 
asymptotically as well as the Lasso without measurement error. We then go on 
to consider the Lasso for Generalized Linear Models (GLMs), and point out 
ways to correct for additive measurement error in GLMs, using the conditional 
score method by Stefanski and Carroll (1987) together with an efficient projec- 
tion algorithm (Duchi et al., 2008) suggested by Loh and Wainwright (2012) for 
linear models with measurement error or missing data. Finally, the analytical 
results are illustrated through simulations. Proofs and additional conditions are 
given in the appendix. 

2. Model Setup 

In Sections 3 and 4, we consider a linear regression model with additive mea- 
surement error, 

y = X/30 + e (1) 
W = X + U, (2) 

with observations of p covariates and a continuous response y G M" on n individ- 
uals. In Section 5, the linear model (1) will be replaced by GLMs, but additive 
measurement error on the form (2) will still be assumed. The rows of the true 
design matrix X g E"^^ and the matrix of measurement errors U € M"^?' are 
assumed i.i.d. according to A/'(0, Sxx) and A/'(0, Suu), respectively. Thus, also 
the measurements, W — X + U are i.i.d. normal, with mean zero and covari- 
ance matrix Sww = Sxx + ^uu- The model error e are i.i.d. according to 
Ci ~ A/'(0, cr^), for i — 1, . . . ,n. Detailed assumptions on the covariance matrices 
will be specified later. Let 

Lo = {j : /3° ^ 0} 

be the index set of non-zero components of the true coefficient vector /3° G M'°, 
and denote the number of relevant covariates by = card{Lo}. Under the 
sparsity assumption, most components of [3^ are zero, such that Iq < n < p. 
Direct use of error-prone measurements is referred to as the naive approach in 
the measurement error literature, and the naive Lasso for a linear model takes 
the form 

^(A) ^ argmm {||y - W/3||2 + A||/3||i} , (3) 

where A > is a regularization parameter. For any A > 0, define the active set 
of the Lasso, 

L(A) = {j:/3,(A)^0}. 

Given /3°, we order the covariates such that Lq — {Ij ■ • • 7 ^0} a-^d = {Iq + 
l,...,p}. Furthermore, we introduce the partitioning W = (Wi,W2), where 
Wi € K"x'o contains the n measurements of the Iq relevant covariates, and 
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W2 e M^^^P^'"^ contains the n measurements of the {p — Iq) irrelevant covari- 
ates. The same notation is used for X and U. Sample covariance matrices are 
denoted by S, and subscripts show which covariates are involved. For example, 
the empirical covariance of the measurements is given by Sww = (l/?^)W*W. 
Using Sww as an example, we partition the covariance matrices on the form 

Sww — I o2,l q2,2 

V ''ww ^ww 

where S^-^ e j^'oxZo jg ^^^^ covariance of the relevant covariates, € 
]g(p-'o)x(p-;o) jg ^Ijq covariance of the p — Iq irrelevant covariates, and = 
(^ww)* ^ M'ox(p-;o) Jg ^jjg covariance of the relevant covariates with the ir- 
relevant covariates. Population covariance matrices are denoted by S, and in- 
dexed by subscripts and superscripts in the same way as described for the sam- 
ple covariance matrices. Finally, the true coefficient vector is written on the 
form (3° = (/S")*)*! where f3° € R'° are the non-zero coefficients and 

& M'-P"'"-' is a vector of zeros. The Lasso estimates are divided according to 
the same pattern, i.e., /9(A) = /3 = {^2)^, where (3^ e M'°, (3^ G W-'°, 

and the dependence on A is implicit. Note that the elements of 0i are not 
necessarily non-zero, neither are the elements of /Sj necessarily zero. 



3. Impact of Measurement Error in Lasso 

Using the error- free case as a reference, we will show in this section how known 
results for prediction, estimation, screening and selection are affected by additive 
measurement error. The main results apply to the finite sample case, with p > n, 
but we will start by pointing out an asymptotic result. 

In the absence of measurement error, the Lasso estimates /3(A„) converge in 
probability to the true coefficient vector /3°, when A„/n— )-Oasn— )-oo (Knight 
and Fu, 2000, Theorem 1). When measurement error is present on the other 
hand, the naive Lasso is asymptotically biased. 

Lemma 1. Assume A„/n — )■ as n 00. Then, as n 00, 

^(A„) 4 S^V5^xx/3". (4) 

Hence, with proper scaling of A„, the bias induced by additive measurement 
error is the same as for a multivariate linear model. Carroll et al. (2006, Section 
3.3.2) demonstrates how this bias might amplify, attenuate, and even change 
sign of the estimated regression coefficients. Without measurement error, the 
Lasso is also root-n consistent, given \nl \fn — > as n — )■ 00 (Knight and Fu, 
2000, Theorem 2). Since the naive Lasso is asymptotically biased, we observe 
that it is obviously not root-n consistent. 



3.1. Prediction, Estimation and Covariate Screening 

We now turn to the case of a finite sample with p > n. In particular, we con- 
sider prediction error ||W(/3 — /3°)||2, estimation error in the £i-norm ||^ — 
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and the covariate screening property Lg C L{\). In our setting, there are two 
noise terms to consider: the model error e and the measurement error U. We 
show analytically how conditions which guarantee succesful prediction and es- 
timation depend on an upper bound of these noise terms. The direct impact of 
measurement error thus becomes clear. For corresponding results with perfect 
measurements, cf. Biihlmann and van de Geer (2011, Chapters 6 and 7) and the 
references cited therein. 

Definition 1. The compatibility condition holds for the index set Lq if, for 

some (j) > and all f3 such that ||/32||i < 3||/3]^||i, it holds that 



lo\\W(3\ 



/3ill^<^^^^. (5) 



Note that (Biihlmann and van de Geer, 2011, Lemma 6.18) 

\\Wf3\\l = /3*W*W/3 = /3*(X'X + 2X*U + U*U)/3 > ||X/3| 



Hence, if the compatibility condition holds with a constant (p* for the unobserved 
true covariates X, then it also holds for the measurements W, but with > 0*. 
Compatibility conditions like this hold with high probability for a broad class 
of Gaussian designs (Raskutti et al., 2010), and thus also hold with at least 
the same probability when the covariates are subject to additive measurement 
error. 

Theorem 1. Assume the compatibility condition holds for Lq with a constant 
4>, and that there exists a finite constant Aq, such that 

2||(e-U/3°)*W||oo < Ao. (6) 

Then, with a regularization parameter A > 2Ao, the following prediction and 
estimation bound holds for the naive Lasso: 

||W(^-/3°)||^ + A||^-/3«||i<^. 

Next, if 1/3^*1 > 4AZo/0^ for j e Lq, we have the screening result 

Lq C L. 

A consequence of Theorem 1 is the prediction bound 

||W(/3-/3°)||^<^ 

and the estimation bound 

ll/3-/3"lli<^- 
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The constant Aq can be interpreted as an upper bound on the total noise. By 
choosing A > 2Ao, this noise is ruled out by the regularization. The noise bound 
(6) holds with high probability when Aq is chosen sufficiently large. In the ab- 
sence of measurement error, (6) reduces to 

2||e*X||^ < Ao. (7) 

Since the left-hand side of (6) in general is larger than the left-hand side of (7), 
measurement error increases the necessary regularization for prediction, estima- 
tion and screening. The upper bounds on prediction and estimation increase as 
A^ and A, respectively, but are possibly eased by a larger compatibility constant. 
The lower bound for covariate screening increases with A in the same manner. 
This is intuitively clear: When the amount of noise increases, the coefficient mag- 
nitude required for detection also increases. Covariate screening is important in 
multi-stage procedures, like the adaptive Lasso (Zou, 2006). It guarantees that 
if the relevant covariates have sufficiently large regression coefficients, 

|/30|>^, forjGLo, 
then, in the first step, they are selected with probability 

P (Lo C L(A)) > P (2||(e - U/3°)*W|U < Ao) , 
whenever A > 2Ao. 

3.2. Covariate Selection 

We now consider exact recovery of the sign pattern of (3^ , which is an important 
goal, e.g., in high-throughput genomics. In the absence of measurement error, 
such sign consistent covariate selection requires an Irrepresentable Condition 
(IC) (Meinshausen and Biihlmann, 2006; Zhao and Yu, 2006). In the presence 
of measurement error, the IC has a new form: 

Definition 2. The IC with Measurement Error (IC-ME) holds if there exists a 
constant 6 £ [0, 1) such that 

l|Sww(Sww)"^«*W/3?)lloo < 0. 

The IC-ME requires that the measurements of the relevant covariates and the 
irrelevant covariates have limited correlation. Since the sparsity pattern of (3^ 
is a priori unnown, the IC-ME is not testable. Replacing the subscripts W by 
X yields the 0-uniform IC in the absence of measurement error. In the noiseless 
case (e = and U = 0), the ^-uniform IC is a sufficient condition for sign 
consistent covariate selection, whereas a slightly weaker version, with = 1, is a 
necessary condition (Biihlmann and van de Geer, 2011, Theorem 7.1). In order 
to achieve sign consistent covariate selection in the presence of measurement 
error, an additional condition is required: 
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Definition 3. The Measurement Error Condition (MEC) is satisified if 

Note that the MEC applies to population covariance matrices, whereas the 
IC-ME applies to sample covariance matrices. As the following result shows, 
the IC-ME is used to obtain a positive lower bound on the probability of sign 
consistent covariatc selection in the finite sample, p > n case. The MEC, to- 
gether with other conditions, is sufficient to obtain sign consistent selection with 
probability approaching one in the large sample limit. 

Theorem 2. Assume the IC-ME holds with constant 0. Then 
P [sign{j3) = sign{P°)j > P{A n B), 

for the events 
and 

< -^(l-^)ll. 

//, in addition, the MEC is satisfied and \f3i\ > |(S^-yv)~"'^^uu'^il' then 
P {^sign0) = sign{(3i'^)^ = 1 — o(exp(— n'')), for some c e [0, 1), 

if \n/n — >■ and Xn/n'^^^'^^/'^ — > oo as n oo. 

As shown in the proof, event A implies that the relevant covariates arc esti- 
mated with correct sign. Given A, event B implies that the coefficients of the 
irrelevant covariates arc correctly set to zero. The left-hand sides of both in- 
equalities in the definitions of the events, involve noise terms. In order for AClB 
to be true, these noise terms must be bounded by the terms on the right-hand 
sides. In event A, there is also a term (S^-yy)^^ on the right-hand side. Here, 
A must be chosen small enough so that the upper bound becomes positive. On 
the other hand, the probability of event B increases with A. Hence, events A 
and B clearly illustrate the trade-off between choosing A small enough to detect 
all the relevant covariates (increasing P{A)), and large enough to discard the 
irrelevant covariates (increasing P{B)). 

The MEC holds when S^x — ^uu ~ ^- correlation between the relevant 
and the irrelevant covariates is an unlikely situation in most applications, un- 
fortunately. The MEC also holds whenever the population covariance matrix of 
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the measurement error has the same form as the population covariance matrix 
of the true covariates, i.e., Suu = cSxx for some constant c. The right-hand 
side of the condition 

l/3?l > |(S^V)~'si)ij/3;|, 

can be interpreted as a perturbation of 0^. and is equal in magnitude to the 
bias which would result from naive least squares estimation of (3^, using only 
measurements of the Zq relevant covariates. This perturbation of j £ Lq, is 
thus not allowed to exceed the magnitude of /3° itself. 

Theorem 2 states that the naive Lasso is asymptotically sign consistent if 
the MEC holds, but this does of course not imply that the naive Lasso is not 
asymptotically sign consistent if the MEC does not hold. Useful insight into 
necessary conditions for sign consistent covariate selection can be obtained, by 
considering the case of no model error, e = 0. In the absence of measurement 
error, the IC is known to be a sharp condition when e = 0: for a finite p > n 
sample, the Lasso will estimate the signs correctly if and only if a version of 
the IC holds (Biihlmann and van de Geer, 2011, Theorem 7.1). Our next result 
states necessary and sufficient conditions for sign consistent covariate selection 
when e = 0, but the covariates are subject to measurement error: 

Proposition 1. Consider the naive Lasso in the case of no model error, e = 0. 
Define the set of detectable covariates by 

Lt = [j ■ > (^||/|;^?'<JI(Sww)"'^illooj 1 + l(S^w)"'S^u/3?l}> 

(8) 

for Ti e M'". // the IC-ME is satisfied and 

then Lq"^* C L{X) C Lq. Conversely, if L{\) = Lq = Lq'^'^, then 

II^Ww(^Ww) + ^ (^Ww(^Ww) ^^WU ~ ^Wu) /^llloo < 1- 

(10) 

From (8), it is clear that the regularization level A determines the coefficient 
magnitude required for detection of covariates, in accordance with the screening 
result of Theorem 1. The condition (9) is a sample version of the MEC, so 
the IC-ME and the MEC are sufficient conditions for sign consistent covariate 
selection in the case of no model error. Condition (10), on the other hand, is a 
necessary condition for sign consistent covariate selection. If it does not hold, 
the Lasso will not be a sign consistent covariate selector. The first term on the 
left-hand side of (10) contains the terms in the IC-ME. In the second term on the 
right-hand side, the matrix expression within the parantheses is again a sample 
version of the MEC. Thus, for the naive Lasso to be sign consistent in the e = 
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hybrid between the IC-ME and the MEC must hold. Asymptotically, 
sign consistent covariate selection requires A„/n — >■ as n — > oo. In this case, 
the first term in the beta-min condition of (8) goes to zero, while the n/X term 
in (10) diverges. This suggests that the MEC is essentially a necessary condition 
for sign consistent covariate selection in the large sample limit. If the MEC does 
not hold, the second term on the left hand side of (10) will diverge. The large 
sample simulations in Section 6.1 support this conclusion, showing that small 
changes of the covariance structure so that the MEC no longer holds, may result 
in a large number of false positive selections, while relevant covariates are not 
selected. 



3.2.1. The Irrepresentable Condition with Measurement Error (IC-ME) 

When the IC-ME holds for a finite sample with p > n, the probability of sign 
consistent covariate selection with the naive Lasso is bounded from below by a 
positive number, as events A and B in Theorem 2 show. When the constant in 
the IC-ME decreases, P{B) increases while P{A) is unchanged. Thus, the prob- 
ability of false positive selections decreases in 6. The IC-ME puts restrictions 
on the covariance structure of both the true covariates and the measurement 
error, and its consequences will now be illustrated. We consider the IC-ME with 
population covariance matrices, i.e., 

l|S^'w(Sl^w)"'sign(/3;)|U <O,0e [0,1). 
Example 1: Equicovariance Matrices 

Assume Sxx is a positive equicovariance matrix, i.e., (Sxx)j.j = cr^, for j G 
{1, . . . ,p} and (Sxx)/,fc = 0-2.PX > 0, for l,k & {1, . . . ,p : I ^ k}. Now, the IC 
holds without measurement error when there exists 6* € [0, 1) such that 

l + l)px - ' 

(Zhao and Yu, 2006). Furthermore, let Suu have the same form, i.e.,(Suu)j,j" = 
(7^ > for j e {1, . . . ,_p} and (Suu)i,fc = c^Pu > 0, for /, fc G {1, . . . ,p : Z =^ 
k}. We now have the following result. 

Corollary 1. Assume Sxx Cind Suu are positive equicovariance matrices. The 
IC-ME holds if there exists a constant 6 £ [0, 1) such that 

(^xPx + g^Pu)^o ^ ^ 

^x + + (^0 - 1)(ctxPx + CTuPu) ~ 

Figure 1 illustrates how the minimum value of 6 in the IC-ME varies with 
the correlation of the true covariates pxi measurement error variance ct^, mea- 
surement error correlation pu and sparsity index Iq. It is assumed that = 1. 
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Fig 1. Values of 9 in the IC'-ME for different variance parameters, cf. Section 3.2.1. 



In the Iq = 25 case (dashed hnes), 9 is always larger than when Iq — 5 (full 
lines), except in the case of no correlation (upper left plot, at pu — 0). Hence, 
when there are many relevant covariates, it is harder to avoid false positives. It 
is also clear from Figure 1 that when the measurement errors are less correlated 
than the true covariates, the IC-ME holds with a lower 9 than the IC with- 
out measurement error (red lines). On the other hand, when the measurement 
errors are more correlated than the true covariates, the IC-ME holds with a 
larger 9 than the IC. In the latter case, irrelevant covariates have a high chance 
of being strongly correlated with some relevant covariate, and thus be selected 
at a lower regularization level than some relevant covariate. When px > Pui 
the 9 values decrease in a^. In this case, increasing measurement error variance 
makes the measurements less correlated than the true covariates, thus blurring 
out correlations between relevant and irrelevant covariates, and contributing to 
a lower probability of false positive selection. Of course, the measurement error 
also tends to make detection of the relevant covariates harder, by weakening 
the association between the outcome and the relevant covariates. On the other 
hand, when px < Pu, increases in a^, since the measurements typically will be 
more correlated than the true covariates, increasing the chance of false positive 
selections. 



Example 2: Block Structured Covariance Matrices 



Next, consider the case of block diagonal covariance matrices. This setup has 
previously been considered in the context of gene expression analysis. Groups 



■'XX 
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of genes within a functional group or pathway can be assigned to a block, and 
assumed to be independent of the other blocks (Tai and Pan, 2007), as in the 
covariance matrix on the form 

/ Sxxi ... \ 

Bxx,2 

■■ 

V ... Bxx^fc J 

with a total of k blocks. Without measurement error, the IC holds as long as it 
holds for each submatrix Bxx,i, I = 1, ■ ■ ■ ,k (Zhao and Yu, 2006). Assume Suu 
has exactly the same structure as Sxx, with blocks Buu.;. ^ — (B^^ ;)* 

denotes the covariance between the irrelevant covariates and the relevant covari- 
ates in block I, and B^x / is the covariance between the relevant covariates in 
block I. Equivalent notation is used for Buu,;. Finally, /3j ; is the true coefficient 
vector of the non-zero covariates in block I. 

Corollary 2. Assume Sxx md Suu are block structued, as defined above. 
The IC-ME holds if there exists a constant 9 G [0, 1), such that 

m'xki + BuuJ(Bkk; + B\j\, ,rhign{f3l)\\oo < 9 , for I ^ I, . . . ,k. 

As an example, let Bxx,; a-nd Buu,i be positive equicovariance matrices, 
where Bxx,; has diagonal ; and off-diagonal cr^ iPx,i, and Bu./ has diagonal 
Uu I and off-diagonal cr\j iPv^i, for ^ = 1, . . . , fc. It follows from Corollary 1 and 
Corollary 2 that the IC-ME holds if there exists a constant 9 E [0, 1), such that 

('^x.iPx,i+gu.;Pu.;)^o,i f , T u n^\ 

2 -71 7Vr-2 , 2 r < 9, tor I ^ 1,. . . ,k, (11) 

where Iq i is the number of relevant covariates in block I. Hence, every block must 
satisfy the IC-ME individually. If block I has no relevant covariates, the left- 
hand side of (11) is zero for this particular element. Hence, only the correlation 
structure of blocks containing at least one relevant covariate, are bounded by the 
IC-ME. This is intuitive, since the between-block correlation is zero, so a block 
without relevant covariates will typically have very small correlation with the 
outcome. For the blocks with relevant covariates, the discussion after Corollary 
1 applies here as well, together with the illustrations of Figure 1. 



4. Correction for Measurement Error in Lasso 

Sign consistent covariate selection with the naive Lasso requires that the MEC 
is satisfied; a much stronger condition than the IC, which is necessary in the 
absence of measurement error. Correction for measurement error is thus needed, 
and in this section we will consider a corrected Lasso, which yields sign consistent 
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covariate selection under an IC-type condition. The correction we will use, is 
motivated by the fact that the loss function of the naive Lasso is biased: 

E(||y-W/3||^| X,y) = ||y-X/3||2+„/3*Suu/3. 

This suggest the definition of the Regularized Corrected Lasso (RCL), 

^RCL = argmin {||y - W/3||2 - n/3*Suu/3 + A||/3||i} , (12) 

m\i<R 

introduced by Loh and Wainwright (2012). The loss function of the RCL is 
always non-convex when p > n, and its parameter space must be restricted 
to the £i ball ||/3||i < R with some finite radius R. A related problem is the 
Constrained Corrected Lasso (CCL). 

^CCL = argmin {||y - W/3||2 - n/3*Suu/3} , (13) 

il/3iti<« 

where k is a constraint parameter, to be chosen by some model selection pro- 
cedure. When the loss function is convex, RCL and CCL are equivalent by the 
strong duality theorem. Unless distinction is necessary, we will refer to both as 
the corrected Lasso. The same correction has been proposed for linear regres- 
sion with the SCAD penalty (Liang and Li, 2009; Xu and Yu, 2007). Since the 
Lasso does not possess the oracle property of the SCAD (Fan and Li, 2001), the 
results of those papers do not immediately hold for the Lasso. The corrected 
Lasso has already been shown to yield good estimation bounds (Loh and Wain- 
wright, 2012), and we will now study its capacity for sign consistent selection. 
We first define an IC for the corrected Lasso: 

Definition 4. The Irrepresentable Condition for the Corrected Lasso (IC-CL) 
holds if the matrix (S^-y^ ~ ^uu) invertible, and there exists a constant 
9 e [0, 1) such that 

ll(Sww - ^uu)(Sww - Suu)"^s«W/3i)IU < 0. 

When the empirical covariance matrices are replaced by population covari- 
ance matrices, the IC-CL reduces to the standard IC without measurement 
error. We now give finite sample conditions, under which the RCL performs 
sign consistent covariate selection. It is also clear that the covariate selection 
capacity of the Lasso with perfect measurements is recovered asymptotically. 

Theorem 3. Assume the IC-CL holds with constant 6. Let f3 denote any local 
optimum of the RCL. Then 

P [sign{P) = sign{l3")'j > P{A n B), (14) 

for the events 

< - ^l(Sww - Siiu)-'s»W/3?)l)} (15) 
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and 

^ = || ^(^ww ^ ^uu)(Sww ~ ^uu) ^ \7^) ^~ 

((^ww ^ ^uu)(^ww ^ ^uu) ^(^wu ^ ^uu) ~ (^Wl 

Furthermore, 

P [sign0) = sign{l3°)^ ^ 1 - o(exp(~n'=)), for some c G [0, 1), (17) 

*/ A„/n — ^ and Xn/n^^^'^^/'^ — > oo as n oo. 

Our analysis of the naive Lasso in Section 3.2 showed that sign consistent 
selection in that case required the very strict MEG (Definition 3). The RCL, 
on the other hand, performs sign consistent selection under the weaker IC- 
CL (Definition 4), which is very similar to the IC. The lower bound on the 
probability of sign consistent selection in the p > n finite sample case (14), 
applies to any local optimum of the Lasso. Under a Lower Restricted Eigenvalue 
Condition (Loh and Wainwright, 2012, Theorem 1), the local optima obtained 
by a projected gradient descent algorithm by (Agarwal et al., 2011; Duchi et 
al., 2008), are in a small neighborhood of the set of global optima. Thus, the 
estimation error \\^ — /3°||i is already bounded, and when the event (A D B) is 
satisfied, these local optima will also have the same sign pattern as /3°. 

5. Correction for Measurement Error in Lasso for GLMs 

We now go beyond the linear model, and consider a Generalized Linear Model 
(GLM), for which Y given X has density 

/(y |x, 0) = exp I yH^^pnl + c(y, 0) | , 

where r/ — fi + c(-) and !?(•) are functions, and — {ii,(3,(j)) is the 
vector of unknown parameters, where fj, is the intercept and (p is the dispersion 
parameter. In the classical case {p << n), and in the absence of measurement 
error, a consistent estimate is obtained by maximum likelihood estimation. 
When the covariates are subject to additive measurement error, unbiased score 
functions can be constructed using the method of Stefanski and GarroU (1987), 
yielding consistent estimators of 0. The method is reviewed by GarroU et al. 
(2006, Ghapter 7), and we will follow their notation. Introducing 




^ = w + 2/Suu/3/(?!', 
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the conditional density becomes 

/(2/|,5,e,Suu) = 

Jot, -P, (77,,0,/3*Suu/3) , / - fltv. a\( nc^ 
exp<j ^— ^ +c* (y,(/),/3 Suu/3) [> , (18) 



y\ ^''^*'ys,j 1^0. (20) 



where 77*, c*(-) and !'*(■) are modifications of the functions used in the absence 
of measurement error. Given an i.i.d. sample (y, W), the loglikelihood becomes 

n 

/(©; y, W, Suu) = {/(j/^l-^- ^uu)} , (19) 

j=i 

where 

A consistent estimator of © is now obtained by solving the unbiased estimating 
equation 

1 

(^0-(y,-4^P.)Vj|l?. 

The second line in (20), involving the estimation of is not feasible when 
p > n, so high-dimensional problems are ruled out when </> is unknown. However, 
there are important cases in which </) is known, including logistic and Poisson 
regression. We will thus continue, assuming is known or that an estimate 
exists. 

Corrected Lasso estimates for GLMs with measurement error in the p > n 
case can be obtained by solving the unbiased estimating equation (20), con- 
straining the regression coefficients to the parameter space 

B{K) = {f3:\m,<n}. 

The algorithm proposed by Loh and Wainwright (2012) for linear models with 
measurement error, extends directly to GLMs. The intercept is not penalized, 
so the iterates take the form 

" / an \ 

(21) 

\ ] , (22) 

for s = 1,2, . . . , until convergence. IIg(K)(-) denotes projection onto B{k), a is 
the stepsize and is the value of the function !?*(•) for subject i at iteration 
s. An efficient projection algorithm is proposed by Duchi et al. (2008). If co- 
variate selection is the primary goal, and a perturbation method like stability 
selection (Meinshausen and Biihlmann, 2010) is used to estimate the active set, 




0. S0rensen et al./ Measurement Error in Lasso: Impact and Correction 



15 



the loglikelihood (19) is not required. An estimate L is obtained directly from 
the iteration scheme (21)-(22), using the conditional score function of the model 
considered, together with an initial estimate /3*~°. On the other hand, if the 
optimal constraint level is estimated by cross validation, the loglikelihood (19) 
is used to compare models at different values of n. In our simulations, using the 
vector of zeros as a starting point yielded fast convergence of the method. 

Conditional score functions for GLMs with measurement error can be derived 
directly (Carroll et al., 2006, Chapter 7). We will here consider logistic and 
Poisson regression, for which 0=1 and corrected Lasso estimates are easily 
obtained. 



5.1. Logistic Regression 

Logistic regression is important in high-dimensional classification problems. Ex- 
amples include detection of differentially expressed genetic markers in case/control 
studies (Ayers and Cordell, 2010; Wu et al., 2009) and discrimination between 
cognitive conditions using fMRI data (Ryali et al., 2010). A detailed analysis of 
estimation bounds, covariate screening and covariate selection with ^i-penalized 
logistic regression in the absence of measurement error, is given by Bunea (2008). 
Here, we consider binomial logistic regression, with response ~ B {l,H{r])), 
i — l,...,n, where H{ri) = {1 + cxp(— 77)}"^ is the logit function and B(-) 
denotes the binomial distribution. When the covariates are subject to additive 
measurement error, the terms in the modified density (18) are: 

ij, =^ + /3*(w + j/Suu/3) 

c,(2;,/3*Suu/3) = (-2yV2)/3*Suu/3 

2?,(77,,/3*Suu/3) = log {1 + exp (77, - (l/2)/3*Suu/9)} , 

and 

BV 

^ =il{77*-(l/2)^*Suu/3}. 
Hence, the iteration scheme (21)- (22) becomes 

n 

/i^+i = + a ^ {y, -H{^,' + (/3^)*w, + (y, - l/2)(/3^)*Suu/3^}) , 



i=l 



nB(,) \f3'+aJ2{yr-H{fi' + (r)*w,; + {y, - l/2)(r)*Suur }) (w, -I- y.Sui 



1=1 



for s = 1, 2, . . . until convergence. Simulation results with logistic regression are 
shown in Section 6.3. 



0. S0rensen et al./ Measurement Error in Lasso: Impact and Correction 



16 



5.2. Poisson Regression 

Poisson regression is used when the outcome can be modeled by a Poisson 
process, yi ~ Pois(e''), i — l,...,n. An example with high-dimensional data 
is given by Huang et al. (2010), who define the spatial Lasso, which is appHed 
with a Poisson regression model to study the distribution of tree species in a 
geographic area. In the case of additive measurement error, the terms in the 
modified density (18) are: 

T]^ = ^ + /3*(w + ySuu/3) 
c,(2/,/3*Suu/3) = -log(y!) - (yV2)/3*Suu/3 

{OO 
^(0!)-i exp {zrj, - (zV2)/3*Suu/3} 
z=0 

and 

^ EZq exp {zrj, - (zV2)/3*Suu/3} 

9ri* EZoi^'-)'' exp {zv* - (zV2)/3*Suu/3} ' 

Hence, the iteration scheme (21)-(22) for Poisson regression involves numerical 
approximation of the infinite sums in dT>^,/dri^, but is otherwise straightforward. 

6. Simulations 

6.1. MEC in the Large Sample Case 

Under the MEC (Definition 3), the naive Lasso is sign consistent for covariate 
selection in the large sample limit. Proposition 1 also suggests that the naive 
Lasso will not be sign consistent, when the MEC is not close to being satisfied. 
On the other hand, the corrected Lasso is sign consistent under the IC-CL 
(Definition 4). We designed numerical experiments to investigate how sharp 
a bound the MEC is, by setting up four cases with very similar covariance 
structures. In particular, we define: 

• Case I: S^x ^xx ^'^e equicovariance matrices, with diagonal = 1 
and off-diagonal px^x = (0-4)o'x- ^uu and Suu ^'"^ also equicovari- 
ance matrices, with diagonal a^j — 0.2 and off-diagonal p\jcr^ — (0.4)o'u- 
Finally, S^^ = (S^^^)* = « and = (S^j^u)* = 0. 

• Case II: Sxx is an equicovariance matrix with diagonal (t^ — 1 and 
off-diagonal px^x = (0-4)crx; and Suu ~ ctu^xx with = 0.2. 

• Case III: As in Case II, but with S^u = (^uu)* ~ 

• Case IV: As in Case II, but with = (^xx) = 0- 

In Case I, the MEC is satisfied because the relevant and the irrelevant co- 
variates are uncorrelated, both in true covariates and in measurement error. In 
Case II, the MEC is also satisfied, since Suu equals a constant times Sxx (this 
also holds in Case I). In Case III and Case IV, the MEC is not satisfied, on the 
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Case I Case II 




0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 

FPR FPR 



Fig 2. Covariate selection with large n of the naive Lasso (red) and the corrected Lasso 
(blue), cf. Section 6.1. 

other hand. The irrepresentable conditions IC-ME and IC-CL hold in all cases. 
Thus, the theory predicts that the naive Lasso in the large sample limit, will be 
a sign consistent selector in Case I and II, whereas the corrected Lasso will be 
sign consistent in all four cases. 

In order to investigate the asymptotic dependence on the MEC, simulations 
were run with n — 1000, p = 35 and Iq = 10. The coefficient vector was set to 

/30 = (7,4,2, 1,1, 7,4,2,1, 1,0,. ..,0)*, 

and the standard deviation of the model error was a — 0.5. A total of 10 
design matrices were randomly drawn from A/'(0, Sxx), whereupon 5 responses 
were generated. Finally, 10 measurements were generated for each (X, y) pair, 
yielding a total of 500 Monte Carlo samples for each set of parameters. W was 
scaled and centered, and y had its mean subtracted. The whole regularization 
paths of both the naive Lasso and the corrected Lasso were computed for each 
set of (W,y). 

The simulation results are shown in Figure 2, in which the blue curve rep- 
resents the corrected Lasso and the red curve the naive Lasso, for a range of 
regularization parameters. The Lasso solution based on the true covariates X 
(not shown) was also computed, and was close to optimal in all four cases. The 
vertical axis shows the sensitivity of the estimates 0, defined as the fraction 
of j € Lq for which sign(/3j) — sign(/?°). The horizontal axis shows the false 
positive rate (FPR), defined as the fraction of j e for which sign(/3_,) ^ 0. 
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It is clear from the two upper plots that the naive and the corrected Lasso 
are equally good at covariate selection in Case I and Case II, as predicted by 
Theorem 2 and Theorem 3. The lower left plot shows that the corrected Lasso 
outperforms the naive Lasso in Case III, suggesting that the MEC may be cru- 
cial for covariate selection. In Case IV, the corrected version is slightly better 
than the naive. Case II and Case III differ in that the measurement error of 
the relevant and the irrelevant covariates are uncorrelated in the latter, and 
this slight modification of Suu yields strikingly worse covariate selection by 
the naive Lasso. We also note that the overall performance of both methods 
is better in Case I and Case IV, in which the true values of the relevant and 
the irrelevant covariates are uncorrelated. In these cases, the IC-ME and the 
IC-CL for the population covariance matrices hold with a smaller constant 9, 
thereby increasing the probability of sign consistent covariate selection for each 
random sample. On the other hand, in Case II and Case III, the true values of 
the relevant and the irrelevant covariates are correlated, yielding a larger 9 in 
the IC-ME and IC-CL, making covariate selection harder. 

6.2. High-Dimensional Linear Model 

The goal of our next example, is to assess the statistical performance of the 
corrected Lasso in a more typical setting. In practice, the regularization or con- 
straint parameter is selected by cross-validation. Hence, for a high-dimensional 
sample, we compare the cross- validated solutions of the corrected Lasso and the 
naive Lasso. In particular, we let n = 100, p — 200 and ~ 5. The vector of 
true regression coefficients was 

/3" = (7,4,2,1,1,0,...,0)*, 

identical to the /3° used by Zhao and Yu (2006, Section 3.3), except for a different 
number of zeros. The true covariates were generated from A/'(0, Sxx), with Sxx 
an equicovariance matrix with diagonal a\ = 1 and off-diagonal px^x — 0-2. 
The variance of the model error was set to a — 0.4. The measurement error 
added to the true covariates was generated from A/'(0,Suu), with Suu an 
equicovariance matrix with diagonal (7^ and off-diagonal pu^u- Four different 
combinations of u\j and pu were considered, as shown in Table 1. 

For each pair of measurement error parameters (cru,Pu), 100 true design 
matrices X were generated. For each of these, a response y and a measurement 
W = X -|- U were generated. The response had its mean subtracted, and the 
measurement matrix was scaled and centered. The naive Lasso estimate /3,iaive 
was computed with the R package glmnet, using ten-fold cross-validation. A grid 
of 100 A values was considered. As recommended by Hastie et al. (2009, p. 244), 
the largest regularization within one standard error of the minimum of the cross- 
validation curve was chosen (leunbda. Ise in glmnet). The Lasso estimate /9perf 
for the case of perfect measurements was also computed in each case, using the 
same procedure, with X instead of W. Next, the corrected Lasso estimate 13^^^,. 
was computed. The constrained version of the corrected Lasso (equation (13)) 
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was used, to avoid dealing with multiple optimization parameters. The candidate 
grid of constraint parameters k consisted of 100 equally spaced numbers in the 
range [(10~'^)|j,9°||i, (1.5)|l/3''||i]. The loss function of the corrected Lasso was 
used to estimate the prediction error. For each sample, k, was chosen as the 
smallest constraint within one standard error of the minimum of the cross- 
validation curve. In practice, the exact grid of regularization parameters is not 
crucial. When /j" is not available, a good solution may be to use 100 equally 
spaced constraint parameters, e.g., in the range [(10~^)||/3naivellii 3|j/9naivolli]- 

Table 1 shows the model selection performance and estimation error of the 
cross- validated estimates (3^^^^, /^naivo ^^'^ /3perf- The leftmost column shows 
the entries of Suu- The column heading FP" denotes the number of false 
positives, and TP" the number of true positives. The two rightmost columns 
show the estimation error, as measured in the £i- and the ^2-norm. The num- 
bers in each cell show the average over 100 simulations, and the numbers in 
parantheses show the sample standard deviations. 

In all cases considered in Table 1 , the corrected Lasso is much better than the 
naive Lasso in avoiding false positives. This is supported by the analysis of Sec- 
tion 3.2: When the measurements are correlated, the naive Lasso easily picks up 
irrelevant covariates. On the other hand, the naive Lasso performs slightly better 
at selecting the five truly relevant covariates. However, in high-dimensional prob- 
lems, when the sparsity assumption holds, controling the number of false posi- 
tives can be more important (e.g.. Fan et al. (2009); Meinshausen and Biihlmann 
(2010)). The estimation error of the corrected Lasso is also smaller than that 
of the naive Lasso, with one exception, although they are both far from the 
estimation error of the Lasso with perfect measurements. 

A possible explanation of the strong tendency of the corrected Lasso to select 
fewer covariates than the naive Lasso, is the increased variance which is ubiqui- 
tous to measurement error corrections (Carroll et al., 2006, Section 3.5.1). The 
mean-squared error (MSE) of the corrected coefficient estimates may thus be 
much larger than the MSE of the naive estimates, yielding a smaller reduction 
in the cross-validation error by inclusion of an additional covariate. It should 
be noted that the corrected Lasso also performs better than the perfect Lasso 
in avoiding false positives. It is well known that the Lasso selects too many 
covariates, and when the corrected Lasso avoids this, it probably relates to the 
same problem as described above. 

6.3. High-Dimensional Logistic Regression 

We now study correction for measurement error in logistic regression with Lasso 
penalty, introduced in Section 5.1. Simulations were run with n = 100, p = 200 
and Iq — 4, where 01 = (7,-4,5,5)*. A total of 20 true design matrices X 
were generated, with independent and identically normally distributed rows. 
Sxx was an equicovariance matrix, with diagonal = 1 and off-diagonal 
Px^x = 0.2. For each X, one response was generated from B (^1, H{X fS'^)^ , 
and five measurements W = X + U were generated for each (X,y) pair, with 
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Parameters 


Method 


# FP 


# TP 


P-/9"I|1 


ll/3-/3"ll2 


<x^=0.1 
PU = 0.0 


Corr. 


1.90 (2.62) 


4.12 (0.78) 


4.36 (1.14) 


2.01 (0.53) 


Naive 


12.04 (4.42) 


4.78 (0.42) 


5.49 (1.37) 


1.94 (0.43) 


Perf. 


6.10 (3.55) 


5.00 (0.00) 


0.62 (0.16) 


0.25 (0.05) 


a(j = 0.3 
Pu = 0.0 


Corr. 


0.83 (1.18) 


2.82 (0.80) 


6.00 (1.40) 


2.90 (0.68) 


Naive 


15.43 (5.54) 


4.20 (0.77) 


9.63 (1.95) 


3.28 (0.56) 


Perf. 


6.92 (3.12) 


5.00 (0.00) 


0.67 (0.16) 


0.26 (0.05) 


a^=0.1 
Pu = 0.3 


Corr. 


1.77 (2.07) 


3.98 (0.78) 


4.71 (1.04) 


2.17 (0.45) 


Naive 


5.50 (4.59) 


4.29 (0.67) 


5.38 (1.29) 


2.24 (0.42) 


Perf. 


6.37 (3.66) 


5.00 (0.00) 


0.62 (0.15) 


0.25 (0.04) 


a(j = 0.3 
Pu = 0.3 


Corr. 


1.46 (1.90) 


2.96 (0.75) 


6.11 (1.53) 


2.84 (0.67) 


Naive 


3.56 (4.53) 


3.15 (0.69) 


8.40 (1.86) 


3.69 (0.59) 


Perf. 


6.56 (3.62) 


5.00 (0.00) 


0.64 (0.17) 


0.26 (0.05) 



Table 1 

Performance of the cross-validated solution of the corrected Lasso ( Corr.), naive Lasso 
(Naive) and the Lasso with perfect measurements (Perf.), in terms of number of false 
positives out of 195 (# FP) , number of true positives out of 5 (# TP), and estimation 
error in the £i- and l2-norm. See Section 6.2. 



Logistic Regression, Case i Logistic Regression, Case ii 




Fig 3. Covariate selection for logistic regression with p > n for the naive Lasso (red), cor- 
rected Lasso (blue) and the Lasso with perfect measurements (green), cf. Section 6.3. 
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U ^ A/'(0,I]uu)- The measurement matrix W was scaled and centered. This 
gives a total of 100 Monte Carlo sample for each of the two cases considered: 

• Case I: Suu an equicovariance matrix with diagonal entries — 0.2 and 
off-diagonal entries pu^u = (O-S)^!;- 

• Case II: Suu a Toeplitz structure, with (Suu)j,fc = |0.4|-'^'^+^. Hence, 
it has variance 0.4 on the diagonal, and decaying correlations. This can 
be realistic in microarray experiments, for which covariates which are spa- 
tially close, may have correlated measurement errors. The indices of the 
relevant covariates were randomly drawn from {1, ... to avoid the un- 
realistic situation where the measurement error of all relevant covariates 
is strongly correlated. 

Figure 3 shows the covariate selection performance of the corrected Lasso (blue 
line) , the naive Lasso (red line) and the Lasso with perfect measurements (green 
line), along the whole regularization path. The left end of the horizontal axis 
corresponds to a constraint k = 10^"'||/3"|| i and the right end to k = 10~^||,9°||i. 
The corrected Lasso performs slightly better than the naive approach in both 
Case I and Case II. 

7. Discussion 

In this paper, we have shown how linear regression with the Lasso is affected 
by additive measurement error. Our analytical results describe how bounds on 
prediction and estimation error increase in measurement error. For covariate 
screening, the dependence of the beta-min condition on the measurement error 
is also derived. Furthermore, our analysis shows that naive use of the Lasso when 
the covariates are subject to measurement error, may yield very poor covariate 
selection, particularly when the measurement errors have strong correlations. A 
simple correction method was considered, studied earlier by Loh and Wainwright 
(2012). Our finite sample results show conditions under which this corrected 
Lasso will be a sign consistent covariate selector. Asymptotically, sign consistent 
covariate selection with the corrected Lasso requires conditions very similar 
to the Lasso in the absence of measurement error (IC-CL, Definition 4). In 
contrast, asymptotically sign consistent covariate selection with the naive Lasso 
essentially requires the relevant and the irrelevant covariates to be uncorrelated 
(MEC, Definition 3). We have also pointed out methods of correction for additive 
measurement error in GLMs with the Lasso penalty. Using the iteration scheme 
suggested by Loh and Wainwright (2012) for linear models, corrected Lasso 
estimates for GLMs are computed efficiently even when p» n. 

Throughout the paper, the covariance matrix of the measurement errors has 
been assumed known. In practice, however, it has to be estimated from repeated 
measurements. An example from biomass characterization is given by Biagioni 
et al. (2012), where repeated measurements of p = 421 covariates on n = 759 
study units were used to estimate the measurement error and improve covariate 
selection. For microarray data, the Bayesian Gene Expression (BGX) method of 
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Hein et al. (2005) yields posterior distributions of the expression of each gene, 
whose variance and covariance can be used to estimate the measurement error. 
Our analysis suggests that taking measurement error into account, may improve 
covariate selection, particularly by reducing the number of false positives. How- 
ever, in our simulations in the high-dimensional settings, the corrected Lasso 
does not seem to perform too well. When the covariance matrix of the measure- 
ment errors is estimated, the performance will deteriorate further. We should 
have in mind, though, that our simulations in these situations were based upon 
a sample size of 100, which is quite low. For larger sample sizes, we assume the 
correction to perform better. 

Appendix A: Proofs and Technical Details 
A.l. Regularity Conditions 

We first note a convergence property of the sample covariance matrices, which 
will be useful in the large sample results. Subscripts n will be used when large 
sample limits (n — > oo) are considered, but are otherwise omitted. 

The true covariates (xi, . . . , x„) and the measurement errors (ui, . . . , u„) 
are assumed i.i.d. according to A/'(0, Sxx) and A/'(0, Suu), respectively, where 
all elements of the covariance matrices are finite. It follows (Anderson, 2003, 
Theorem 3.4.4), that the limiting distribution of v^(Sxx,ra — Sxx) is normal 
with mean and covariances crxCTx + o'x'''x ' where is element (z, k) of Sxx 
and i,j,k,l G {1, . . . ,p}. The same result applies to the limiting distribution of 
V^(Suu,n — Suu) and v^(Sww,n — Sww), by replacing the subscript X with 
U or W, respectively. Now, the deterministic convergences 

Sxx,™ Sxx, as n ^ oo (23) 

and ^ 

- max ((x")*x") 0, as n -> oo, (24) 

ni<i<n v\ * ' « / ' V ^ 

hold with probability 1 (Zhao and Yu, 2006, Section 2.1). These convergences 
also apply to U and W as well, by replacing the subscripts. Regularity conditions 
(23) and (24) have also been assumed by, e.g.. Knight and Fu (2000) and Zhao 
and Yu (2006). 



A. 2. KKT Conditions 

We introduce the new coefficient 7 = /3 — /3", which yields the naive Lasso (3) 
on the form 

7 = argmin (--e*W7 + 7*Sww7 + 27*Swu/3" + -||7 + /3"l|i| , (25) 

where we have removed all terms which are constant in 7. Taking derivatives, 
we arrive at the KKT conditions for the naive Lasso: 
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Lemma 2. 7 = /3 — /3 is a solution to (25) if and only if 

--e*W + 2(Sww " Suu)7 + 2(Swu - Suu) = --t, 
n n 

where t is a vector such that 11x1100 < 1 o,nd Tj = sign{/3j) for j such that 

h ^ 0. 

The same change of variables for the corrected Lasso (12), yields 

7CL = argmin | - -e*W7 + 7*(Sww - Suu)7 + 27*(Swu - Suu)/9° 
Il7+/3;i|i<fl I " 

+ -Il7 + /3°lli 
n 

(26) 

Again taking derivatives, we arrive at the KKT conditions. 

Lemma 3. 7 = f3 — f3i is a critical point of the corrected Lasso «/||7+/3i||i < R 
and 

--e*W + 2(Sww - Suu)7 + 2(Swu - Suu) = --t, 
n n 

where r is as defined in Lemma 2. 
A. 3. Proof of Lemma 1 

This proof goes along the lines of the proof of Theorem 1 in Knight and Fu 
(2000), but with the addition of measurement error. We start with the naive 
Lasso after reparametrization (25), and denote its Lagrange function by 

Cn{l) = +'^*Sww,„7 + 27*Swu.„/3° + — 1|7 + /3°||i. (27) 

Note that 

^ ^"-e„4AA(0,(4/n)a2Sww). 



That is, the first term in (27) converges in distribution to a normally distributed 
quantity whose variance goes to zero as 1 /n, which is equivalent to convergence 
in probability to zero. Combining this result with the assumption that A„/n — ;> 
as n — >■ cxD, yields 

£„(7) 4 £(7) = 7*Sww7 + 27*Suu/3°. 
Since Cn{^^ is convex, it follows that (Knight and Fu, 2000) 
argmin{£„(7)} A argmin{£(7)}. 

The minimum of £(7) is easily found, and accordingly. 
The result follows immediately. 
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A. 4- Proof of Theorem 1 

The basic inequality (Biihlmann and van de Geer, 2011) for the Lasso follows 
directly from its definition (3): 

||y - WMI + mWi < \\y - W/3"||2 + A||/3"||i. 

Reorganizing terms, we arrive at 

||W(^ - + X0\\, < 2{e - lJ(3yW0 - /3") + X\\(3%. (28) 

Under the noise bound (6), it is clear that 

2(e - U/3")*W(^ - /3") < 2||(e - U/3°)*W||^p - /3"||i < Aop - /3°||i, 

which, inserted into (28) yields 

||W(^ - /3")||2 + All^lli < Aoll^ - (3% + X\\I3%. 

Now use the inequality 

Pill >ii/3;iii- 11^1 -/3?iii + P2iii 

and A > 2Ao, to obtain 

2||W(^ - f3°)\\l + A||^2||i < 3A||^i - (3%. (29) 

Inequality (29) shows that 02\\i < 3||/9i - /3?||i. That is, the vector /3 - /3° is 
among the vectors to which the compatibility condition applies, for the index 
set Lq. Next, use the equality 

\\fi-f3"\U = 0,-f3% + \\^,\\, 

in (29), to obtain 

2||W(3 - /3°)||2 + Ap - /3?||i < 4Api - f3%. (30) 

Since the compatibility conditions holds for the set Lq, and /3 — /3" is in the set 
of vectors for which the compatibility condition holds, we have 

Using this and the inequality Auv < Au^ + in (30) , we arrive at 

||wO-/3°)||^ + A||^-/3;||i<^. (31) 
The screening result follows immediately from (31), by noting that 

p-/3;iico<^. 

Thus, if |/3"| > 4A;o/0^ for j e Lq, then sign(/3j) = sign(/30). 
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A. 5. Proof of Theorem 2 

We follow the structure of the proof by Zhao and Yu (2006, Proposition 1 and 
Theorem 1), who proved the corresponding result in the absence of measurement 
error. Consider the naive Lasso (25), and note that (Zhao and Yu, 2006) 

{sign(/3;)7i > -\f}\\} ^ {sign(^i) = sign(/3?)} 

and 

72 = ^ ^2 = 0. 

Thus, by the KKT conditions for the naive Lasso (Lemma 2), if a solution 7 
exists, and the following three conditions hold: 

W* A 
- -^e + V^S^V7i + - - ;^sign(/3;) (32) 

l7il < l/3?l (33) 

I - + V^S^V7i + ^A^S^V/3?I < (34) 

then sign(/9i) — sign(/3?) and sign(/32) = 0. 

Event A implies the existence of {jil < \f3i\ such that 

-.1,1 N-i^i, /z:tci-i \-ic!i.i /aOi /- A I ^ 



|(S{^V)"'^e-^(Sww)"'S^u/3?l = l7il - -\iS^^)-hign{(3'i) 
But then, there must also exist < \f3^\ such that 

which essentially means choosing the appropriate sign of the elements of 7]^. 
Mutliplying through by S^-^ and moving the first term on the right-hand side 
over to the left-hand side, we get (32). Thus, A ensures that (32) and (33) are 
satisfied. Next, adding and subtracting v^^ww7i left-hand side of event 

B, and then using the triangle inequality once, yields 

I W2 r<S?-^ ~ ^ /aOi I o2,l 

+a/"S^V(S^w)"'S^u/3; + a/"S^'w7iI < ^ (1 - (?) 1. 

The second term on the left-hand side of this expression is the left-hand side of 
(32) multiplied by S^-^(S^-^)~^. It can thus be replaced by the right-hand 
side of (32) multiplied by the same factor, i.e., 

^ S^w(Sww)~^sign(/3'j'). 



2^ 
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This yields 

which implies, due to the IC-ME, 

V ^ ^ 

This is indeed (34). Altogether, A implies (32) and (33), while B\A implies (34). 
For the asymptotic result, define the vectors 

a" = |/3;|-|(S^VJ"'S^u,n/3?l 



C — ( SwW,n(^WW,n) 

fn_/o2,l /'tjl,! ^-Icjl.l _ c2.1 \ ^0 
I — \ '='WW,nl'-'WW,rJ ''wU.n ^WV-n Pi' 



It follows that 

lo 



P{A)>Y^p(\z-\<V^{a--^\b]\ 



and 



so 



p-la 

E 



p(B)^5:p(lc;-y^/;i<-%(i-0) 



1 - P{A nB) < PiA") + PiB") 

lo / \ \ / \ 



It is clear that 



z" 4 AA(0,a2(S^i^)-i), as n -> oo 



n\2 ^ „2 



Hence, there exists a finite constant s, such that E(z") < s for j = 1, . . . , 
Next, we have by assumption, 

a" ^ \0i\ - |(S^V)"'Sl)ij/3;| > 0, as n ^ c^, 
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and 

b" ^ (S^V)"'sign(/3?), as n ^ CX3. 
Now, using the assumption A„/n = o(l), we get 

W<E(i-^(^<^«"(1 + o(1)))) 
<(l + o(l))X](l-<f>(^«"(l + o(l)) 

where we used the bound for the Gaussian tail probabihty, 

l~<S>{t) <t-^exp{~^t^). (35) 

Next, we note that 

C" 4 (0,a2(S^V - S^w(S^w)"'Sww)) , as n ^ oo. 
Next, we consider f", and note that the Hmiting distribution of 

as n — >■ oo, is normal with mean Swu = Suu, and finite variances (Anderson, 
2003, Theorem 3.4.4). In addition, 

as n — > cxo. Thus, applying Slutsky's theorem to the product of the matrices, 
the limiting distribution of 

.AT/^rS^a ^-lcl,l _ 02,1 n /y2,l .^1,1 ^-ly.l,l _ y2,l n\ 

as n — )■ 00, is normal with mean and finite variances. The second term is 
identically zero according to the MEG, so the limiting distribution of 

(SwW,ri(^WW,ji) ^^WU,n ^ ^WU,n) ' 

as n — >■ cx), also has mean zero and finite variances. Now 

is a vector in M^^'" whose elements are linear combinations of variables whose 
limiting distribution as n — >■ 00 is normal with mean zero and finite variances. 
Accordingly, the limiting distribution of ^/ni"', as n — >■ 00, is normal with mean 
zero and finite variances. 
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So, again there exists a finite constant s, such that E(C" — y^f") < for 
i = \, . . . ,{p — Iq). Thus, when A„/7i^-^+^^/^ — > oo for c e [0, 1), we have 

<(1 + o(1))E(i-*G2^(1'^))) 
= o(e-"^). 
It follows that P{ A n B) = 1 - 0(6""''). 

A . 6. Proof of Proposition 1 

We consider now the Lasso with e = 0. In this case, y = X/3", and the Lasso 
becomes 

^ = argmm {||W/3 - X/3°||2 + A||/3||i} . 

We follow the proof of Theorem 7.1 of Biihlmann and van de Geer (2011), but 
also take measurement error into account. 

Part 1 

The KKT conditions take the form 

2S^V(/9i - + 2S^V^2 + 28^^/3? = -^ti (36) 
2S^V(^i - + 2S^V/92 + 2S^V/3? = -^t2, (37) 

where t = (t* , T2)* has the property that ||t||oo < 1 and Tj = sign(/3j) if /3j ^ 0. 

We multiply (36) by /32S^.^(S^.^)^^ and (37) by /Sj, and then subtract the 
first from the second, to get 

2^2 (^ww ~ ^ww(^ww) ^^ww) ^2 + 2/32 (^wu ^ ^ww(^ww) ^^wu) /^i 

= ^ (^2S^w(S^V)"'^i - ^2^2) (38) 

The matrix term within the parantheses in the leftmost term is positive semidef- 
inite, since it is the Schur complement (Gentle, 2010) of the positive semidefinite 
matrix Sww, in which the part S^^v postive definite, since Iq < n. Next, 
the term within the parantheses on the right-hand side is 

/32S^iw(S^'w)"'^i - II-92II1 < (l|S^V(S^'w)"'^illoo - l)||^2l|l < 0. 



0. S0rensen et al./ Measurement Error in Lasso: Impact and Correction 



29 



The last inequality follows from the IC-ME, and is strict whenever ||/32lli 7^ 0. 
Finally, the second term on the left-hand side of (38) is zero by assumption. 
Thus, if ||/32l!i 7^ 0) the left-hand side of (38) must be negative, which is a 
contradiction. We thus conclude that = 0, and the KKT conditions (36) and 
(37) reduce to 



2S^'w(^i-/3;) 



2S 



WW 

2,1 

WW 



CP, - Pi) + 2S^V/3? 



A 

n 
X 

n 



(39) 
(40) 



From (39) we get 
l/9i-/3?l 



1,1 N-l ' I (clA \-lcl.l aO\ 



< 



2"1|-r„||<l 



(S 



1,1 ^-1^ 



K^ww) 



iS^V/3il- (41) 



Now, if j e and /3j = 0, then 



|;3i-/3?| = |/3;|> ( A sup 



(^ww) ^"^1 



(^ww) "^^wu/^il 



which contradicts (41). Thus, f3j 7^ for j e Ljj 



det 



Part 2 

We start by assuming sign(/3) = sign(/3'^). Thus, the KKT conditions are (39) 
and (40). From (39) we get 

/^i ^ /3i = ^^(S-ww) ^''^1 ^ (^ww) ^^wu/^i- 

Inserting this into (40) yields 

o2,l /cia N-i- I 2" /q2,1 /al,l \-lcl.l ^ flO „ ~ 

^WWI'^WWJ \ V WWV'^WW/I '^WU '^WU^^'l— ^2, 

and the necessary condition stated in Proposition 1 follows by definition. 
A. 7. Proof of Corollary 1 

The inverse of a p x p equicovariance matrix S, with diagonal 1 and off-diagonal 
yO, is given by 
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where Ip is the p x p identity matrix and Ip x p is ai,pxp matrix of ones (Biihlmann 
and van de Geer, 2011, p. 42). Here, we have 



(p-io)x;o 



and 



q"xPx+guPu 



V 



Introducing the shorthands 



a = fix + (Tu + (^0 - 2)crxpx + cTuPu, 



2,2 \ 



'^xPx+g-uPu 

'^X+'^TT 



and 



we get 



(^x + - (c^xPx + crhPv))i^k + c^u + (^0 - l)crxPx + CTuPu) 



ll.l \-l 



/a 6 
b a 

\b ... 



^x + '^u + (^0 - l)o-x/Ox + cTuPu 



L(p-;o)xio- 



Hence, for arbitrary j3^, 



|S^'w(Sl^'w)"'sign(/3;)| = 



C^xPX + cr|jpu 



P 



< 



+ c^u + ('o - 1)(ct^Px + o-{jPu) 



C^xPX + cr|jpu 



Ip-Io I 



where Ip-ig G 



f^x + + ('o - l)(crx/'x + o-uPu) 
is a vector of ones. The result follows immediately. 
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A. 8. Proof of Corollary 2 

We follow the proof of Zhao and Yii (2006, Corollary 5), who give the corre- 
sponding result in the absence of measurement error. The block matrix inversion 
formula (Gentle, 2010, pp. 95-96) yields 



■r2,1 



WW,l-' 



/ B^V^(B^V^)-isign(/3?,i) 

V B^V,fc(B^w,fc)-'sign(/3;^,) 
and the result follows immediately. 



sign(/3;) 



-"WW,*: / \ y-"'W^,kl 

j2,l 



A. 9. Proof of Theorem 3 

Starting from the KKT conditions of Lemma 3, we will redo the steps of the 
proof of Theorem 2, but with the insertion of extra terms representing the 
correction for measurement error. The corrected Lasso is not in general convex, 
and our analysis will thus concern any critical point -y = (3i — (3i within the 
constraint region ||7 + /3i||i < R- 

If 7 exists, and satisfies the following three conditions: 

-^6 + ^/^(S^V - + aAI(Swu - Sillj)/3? = -^sign(/3°) 

(42) 

l7il<l/3?l 

(43) 

I - + yn(S^V - S?ilj)7i + ^A^(S^u - < 

" (44) 

then sign(/9i) — sign(/35) and sign(/32) = 0. 

Event A (15) implies the existence of {jil < |/3°| such that 

K^ww ^ -^uu) ^^=^ ~ V"(Sww ~ ^uu) ^(^wu ^ ^uu)/3il = 



/n 



l7i|-^l(S^V-Sijij)-isign(/30)f 
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But then, there must also exist |7]^| < \0^\ such that 



I'^ww ^uuJ ~ V "-l^ww ~ ^uu-l I'^WU ~ ^UU-IPl 



^ (^1 - ^(^ww - Siju)"'sign(/3;; 

which essentially means choosing the appropriate sign of the elements of 7]^. 
Mutliplying through by (S^-^ — S^y) and moving the first term on the right- 
hand side over to the left-hand side, we get (42). Thus, A ensures that (42) and 
(43) are satisfied. Next, adding and subtracting \/?i(^ww ~ ^1:11)71 ^^^^ 
left-hand side of event B, and then using the triangle inequality once, yields 



e- 



V^i^ww ^uu)(^ww ^uu) ^(^wu ^uu)/^i" 



-"WW ^uuA'-'ww ^vu' y^wu ^uv 

-.2,1 _ .^2,1 N 



The second term on the left-hand side of this expression is the left-hand side of 
(42) multiplied by (S^^ — Suu)(S\vw ^ ^uu)~^- I* can thus be replaced by 
the right-hand side of (42) multiplied by the same factor, i.e., 



^ (S^V-S?jlj)(S^V-Sijij)-isign(/3?) 



This yields 



A 



2V^ 



(S^V - S?i1j)(S^V - S^ij)-isign(/3;)| < ^{1-9)1, 



'2Vn 

which implies, due to the IC-CL 



W| ^^o2,l ^2.1 , ^/o2.1 vi2,l N/al' - ^ 



- 2V^ 

This is indeed (44). Altogether, A implies (42) and (43), while B\A implies (44). 



0. S0rensen et al./ Measurement Error in Lasso: Impact and Correction 



33 



For the asymptotic result, define the vectors 



„n _ fCi>i y;l,l \-l ' ' 

a" = |/3?| - |(S^^ - S^u) ^(S^u,n ~ ^uu)/3il 

b" = (S^w,„-S^u)"W(/3;) 

/ w* w* 

It follows that 



-1 " l,n 



1=1 ^ 



I) • 



and 



so 



p-Ui / \ 



(1-^) , 



1 - P{A r\B)< PiA") + P(B'^) 

It is clear that 

z" 4 ^(0,<72(S^^)-i(S^V)(S^^)-i), as n ^ 00. 

Hence, there exists a finite constant s, such that < for j = 1, . . . , Iq. 

Next, we have by assumption, 

a" |/3?|, as n oo, 

and 

b" -5> (S^x)~^sign(/3;), as n ^ oo. 
Now, using the assumption A„/n = o(l), we get 

la 



p(A=) < E ( 1 - ^ < + «w) 



<(l + o(l))E(l-$(^a;'(l + o(l)) 



= o(e-"), 
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where we used the bound (35). Next, we note that 

CAM (o, o^lf^^ - S^x(Sxx)"'S^x)) , as n ^ oo. 
Next, we consider f", and note that the Hmiting distribution of 

•\/«(Swu,ji — Suu), 

as n ^ cx), is normal with mean and finite variances (Anderson, 2003, Theorem 
3.4.4). In addition, 

Sww,n — Suu Sxx, 

as n oo. Thus, applying Slutsky's theorem to the product of the matrices, 
the limiting distribution of 

((^WW.ri ~ ^Uu)(^WW,ji ~ ^Uu) ^(^WU,n ~ ^Uu) ~ (^WU,ri ~ ^Uu)) ' 

as n — >■ cxD, is normal with mean and finite variances. Now, 
V^f" = 

((^ww.ri ~ ^Uu)(^WW,ji ~ ^Uu) "^(^WU,n ~ ^Uu) ~ (^WU.ri ~ ^Uu)) /^l' 

is a vector in Rp^'^ whose elements are linear combinations of variables whose 
limiting distribution as n — >■ oo is normal with mean zero and finite variances. 
Accordingly, the limiting distribution of y/nf" , as n — >■ oo, is normal with mean 
zero and finite variances. 

So, again there exists a finite constant s, such that E(C" — ^/nf") < for 
j = 1, . . . , (p — Iq). and when A„/n'^^+'^'/^ ^ oo for c e [0, 1), we have 

.(.).|(.-.(^<-a-,)) 

<(l + o(l))g(l_a>(l^(l-.) 

It follows that P{A n B) = 1 - o(e""'). 
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